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Abstract — We show in this note that by deterministic packet 
sampling, the tail of the distribution of the original flow size can 
be obtained by rescaling that of the sampled flow size. To recover 
information on the flow size distribution lost through packet 
sampling, we propose some heuristics based on measurements 
from different backbone IP networks. These heuristic arguments 
allow us to recover the complete flow size distribution. 

Index Terms — Packet sampling. Flow statistics, Pareto distri- 
bution. 

I. Introduction 

Packet sampling is an efficient method of reducing the 
amount of data to analyze when performing measurements 
in the Internet. The simplest and the most popular packet 
sampling technique consists of selecting one packet every 
other k packets. This technique is referred to as deterministic 
1-out-of-fc sampling in the technical literature and has notably 
been implemented in CISCO routers [1]. Even if this sampling 
scheme suffers from several drawbacks, identified for instance 
in [2], it is widely used in today's operational networks. 

The basic problem of packet sampling is that it is difficult 
to infer the original flow statistics from sampled data. Note 
that a flow is defined as the set of those packets sharing some 
common addressing information, typically the same source and 
destination IP addresses, the same source and destination port 
numbers together with the same protocol type. 

Flow statistics inference from sampled data has been ad- 
dressed in previous studies. Duffield et al [3], [4] study 
the accuracy of different estimators based on multiplying 
the sampled flow size by the sampling factor k, but their 
method does not apply to the complete range of the flow size. 
Hohn and Veitch [5] use generating function techniques to 
invert the flow size distribution but the proposed procedure is 
numerically unstable. Mori et al [6] use a Bayesian approach 
to inferring the characteristics of long flows. 

In this paper, we develop a probabilistic approach to invert- 
ing sampled traffic together with some heuristic arguments. 
First, we note that when observing sampled traffic, we can only 
compute the distribution of the random variable v describing 
the number of packets in sampled flows and Kg the number 
of sampled flows. If there are originally K flows, we have 
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where j = 1 if the zth flow has been sampled j times and 
/ij = otherwise. 

Under some reasonable assumptions on the sampling pro- 
cess, we show in this note that the tail of the original flow size 
distribution can be obtained by rescaling the distribution of 
the sampled flow size distribution. It is however not possible 
to totally recover the original flow size distribution because 
information on small or moderate flow sizes is loss through 
sampling. To overcome this problem, we propose some heuris- 
tic arguments based on measurements and exploiting a priori 
information on flows. We consider here TCP traffic only. 

The rest of this note is organized as follows: In Section HH 
we make some reasonable assumptions on the sampling pro- 
cess. In Section Uni we prove that the tail of the original flow 
size can be obtained by rescaling that of the sampled flow 
size. In Section IIVI we present some heuristic arguments to 
recover the total flow size distribution. Concluding remarks 
ai-e presented in Section |Vl 

II. Assumptions on the sampling process 

When observing in a time window of length A traffic on a 
high speed link, one may reasonably assume that the packets 
of the different active flows are sufficiently interleaved. Hence, 
one may suppose the selection of packets among active flows 
at a sampling time is random. 

Moreover, in a time window of length A, flows start and 
finish and some of them may be silent (for instance in the 
case of flows alternating between On and Off periods). In [7, 
Section 3.3], it is shown that these fluctuations may neglected 
at the first order (i.e., when computing mean values) and it can 
be assumed that flows are permanent. Under the two above 
assumptions, we suppose that the probability of selecting a 
packet of a given flow, say, flow i, is equal to Vi/Vi, where 
Vi is the size of flow i and Vi is the total number of packets 
arrived when flow i is active. 

III. Tail of the sampled flow size 
Let Wj X^ili '^he number of flows sampled j times. 

Proposition 1: \f K flows are active during a time window 
of length A, the mean value E(VKj ) satisfies 
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where p = 1/k, Vi is the random number of packets in a flow, 
and Q is the probabiUty distribution defined by 
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Proof: Let us condition on the values of the set 
T = {vit . . . ,vk,Vi, . . . ,Vk}- Under the assumptions of 
Section [III the number of times that the ith flow is sampled is 
equal to the sum 



Si — B\+ B\ 
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where B^ is equal to one if the ^th sampled packet is from 
the ith flow, which event occurs with probability Vi/Vi. The 
random variables {B},£ > 1) are i.i.d. Bernoulli random 
variables and Le Cam's Inequality [8] then states 
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where is the total variation norm and QE(Si) is a Poisson 
random variable with mean E(5i). By deconditioning with 
respect to the set T, we have by using the distribution Q 
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In particular, for j e N, \P{S^ ^ j) - Qj\ < pE{vf/Vi). 



Since E(pyj) = J2iLi^i^i — j)' summing on i yields 
Equation (|2]l. ■ 
If K is sufficiently large, we have from Equation ([T]l and 
the above proposition, we have for j > 1 
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where v — K^jK v& the probability of sampling a flow. 

Proposition 2: If all flows have a negligible contribution 
to the total volume of traffic (i.e., 'E{vf /Vi) <C 1 for all 
i = 1, . . . , K), if K is sufficiently large, and if the flow size 
distribution has a slowly varying tail, then when j ^ oo 
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Consider the sum e'^(?'^^^P(w ~ i), where f{p,x) = 

jlogx — px, which is maximum at point j/p. By assuming 
that the function £ ¥{v — £) is heavy tailed, Laplace method 
gives for large j 
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where f"{p,j/p) = —p^/j and b^'-P'^/p^ = (j/p)"'e~^. If j is 
sufficiently large, Stirling formula gives j! ^ 
In addition, from [9], we have J2'^=-oo e"" ^ 
a ^ 0. This implies that J^TL-oo^^ LJeiiIpI ^ VZM. 




then V{v = j) ^ P{v = j /p)/p, when j is sufficiently large. 
Since 

F{d>j) ^F{v = k/p) / dP{v^k/p), 

P k=j ^ 

Equation ^ follows. ■ 



IV. Heuristics for the total flow size distribution 

Proposition |2] shows that the tail of the complementary 
cumulative distribution function (ccdf) of the original flow size 
can be obtained by rescaling that of the sampled flow size. We 
can however verify through examples that information on that 
distribution for small or moderate flow size values is lost. 

We exemplify this phenomenon by considering a 2 hour 
long real traffic trace from a 1 Gbit/s transmission link of 
the France Telecom IP backbone network carrying ADSL 
traffic. The original flow size is depicted in Figure l(a)| 



and the deterministically sampled flow size in Figure |l(b) 



which exhibits good agreement with the rescaled distribu- 
tion ¥{v = j/p)/v for sufficiently large j as predicted 
by Proposition |2] But all information for moderate values 
of the flow size is contained in a few values, in this case 
^{v > j) for j — 2,3. The same phenomenon (see [10]) 
has been observed for an Abilene traffic trace available at 
http://pma.nlanr.net/Traces/Traces/long/ipls/3/, 
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(b) Sampled size (p = 1/100). 
Fig. 1. Flow size distribution in the France Telecom ADSL trace. 

In fact, through numerous experiments with real traffic 
traces, it has been observed in [10] that P(w > j/p) can be 
approximated by v¥'{v > j) when j > jo for some jo > 0. 
The problem is then to estimate the quantities F{v = j) for 

j = 1, • • -Jo/p- 1- 

We have from Equation (|7]l 
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and we know by Proposition |2] that for j > jo, this equation 
is equivalent to P(-l' = jo) = = i/p)/{^p)- It follows 
that for determining the (jo/p ^ 1) quantities P(u — £) for 
£ = 1, . . . ,jo/p— 1, we have only Jq equations. The problem 
is hence clearly under-determined. Some heuristics are needed 
to recover the complete flow size distribution. 

It has been observed in [10] that depending on the size of 
the observation window A, the sampled flow size distribution 
can locally be approximated by means of Pareto distributions. 
This leads us to make the following assumption. 

Assumption 1: There exist some to > and some integers 
Jo < ji < ■■■ < jrn — oo such that for £ — 1, . . . , to and 
j e V has a Pareto distribution of the form 

F{v > j) = F{v > {ji-i/jr 

for some shape parameter > 0. 

When A is adequately chosen, the tail may be uni-modular 
(i.e., TO = 1), but when A is too large, we can have to > 1. 
For the above France Telecom trace (A = 2 hours), to = 2 as 
shown in Figure |l(b)| 



^{v > bf)){bo/ j)"''^ ■ Equation ([8]l can then be rewritten as 



By using Proposition |2] we deduce that for 



< j < hn. 
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The above equation implies that P(w > j) can locally be 
approximated by a Pareto distribution with shape parameter 
a„i, as shown in Figure |l(a)| 

For inferring the quantities F{v = j) for j = 1, . . . ,jo/p—\, 
we need more assumptions. Numerous experiments [10] have 
shown that when j < for some bo > 0, P{v — j) follows a 
geometric distribution. 

Assumption 2: There exists some &o > such that for 1 < 
j < bo, ¥{v = j) = (1 — r)r^ for some r > 0. 

The above assumption is supported by experiments, as 
shown in Figure |2(a)| and |2(b)| for the France Telecom and 
Abilene traffic traces, respectively. The value bo ~ 20 has 
been successfully tested in numerous experiments. 
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(a) ADSL trace 



(b) Abilene trace 



Fig. 2. Ccdf of the number of packets in flows with less than bo = 20 
packets. 



By using Equation (|9]) and Assumption |2l we have the form 
of the distribution for j < bo and j > jo/p- To fill the gap, 
we use the following heuristic: ¥{v > j) for bo < j < ji/p 
has the same form as in Equation (|9|l, namely P{v > j) = 



\v < bo) 
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The shape parameters for 1 < £ < m are determined 
from the sampled flow size distribution by using standard 
Maximum Likelihood Expectation (MLE) procedures. The pa- 
rameter bo is set equal to 20; this choice is purely phenomeno- 
logical but corresponds to the number of packets needed to 
leave the slow start regime with a maximum window size of 
32 Kbytes. The parameter P(t) > 6o)/t^ is obtained by using 
Proposition |2] namely by computing the ratio rj P({j > 
j)l{bop/jY^ for j e {jo, ■■■,]!}, which is by assumption 
independent of j. The number of flows with at least bo packets 
is Kq ~ rjKs- Equation (fTOl i multiplied by Kg for j = 1, 2 is 
then used to compute the parameter r and the number Ko of 
flows with less than bo packets. The total number of flows is 
then K = Kq + and the probability of sampling a flow 
is estimated by the ratio Kg/K. 

By using the above method for the France Telecom ADSL 
trace with p = 1/100, we find jo = 3 and the estimated shape 
parameters Si — .54 and 0.2 = 1.81, which are close to the 
experimental values ai — .52 and 02 = 1.81 for the original 
flow size. We then find P(w >bo)/v^.i and since Kg — 
1, 120, 546, we obtain the estimate ~ 336, 163, while the 
actual value is Kq = 343, 004. By neglecting the term due 
to flows with at least 20 packets in Equation ( fTOl i. we then 
find the estimate f = 0.84 while the actual experimental value 
is r = .75. This yields a number of flows with less than 60 
packets Kq ^ 20.1e6 while the actual value is k, 19.8e6. 
Finally, the estimated total number of flows is K — 20.4e6 
while the actual value is if = 20.1e6 and we find the estimate 
V — .054 for the probability of sampling a flow while the 
experimental value \s v — 0.057. 

V. Conclusion 

We have shown in this paper by using probabilistic argu- 
ments that the original size distribution of large flows can 
be recovered from that of the sampled flow size. A critical 
parameter is nevertheless the flow sampling probability, which 
can be estimated only when the size of small flows is known. 
To overcome this problem, we argue that it is necessary to 
exploit a priori information on flows. By using this principle, 
we have shown that it is possible to recover the complete flow 
size distribution together with the number of flows. 
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